Skip to content

Conversation

@anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Dec 6, 2024

…our patch for elapsed_time

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

@chengjunlu there are many messages like: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled). in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158/job/34024201713. They really clutter up the logs and don't provide much information. It would be great to remove them somehow. Is this possible?

@anmyachev
Copy link
Contributor Author

@pbchekin do you know why? (from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158)

FAILED [4.7235s] inductor/test_triton_kernels.py::CustomOpTests::test_autotune_unbacked - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/intel/oneapi/bin/icpx'

@pbchekin
Copy link
Contributor

pbchekin commented Dec 6, 2024

@pbchekin do you know why? (from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158)

FAILED [4.7235s] inductor/test_triton_kernels.py::CustomOpTests::test_autotune_unbacked - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/intel/oneapi/bin/icpx'

There is no such file, indeed. This file is located at:

  • /opt/intel/oneapi/compiler/2024.1/bin/icpx, /opt/intel/oneapi/pytorch-gpu-dev-0.5/bin/icpx in PTDB
  • /opt/intel/oneapi/2025.0/bin/icpx, /opt/intel/oneapi/compiler/2025.0/bin/icpx in DLE

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

@pbchekin do you know why? (from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158)

FAILED [4.7235s] inductor/test_triton_kernels.py::CustomOpTests::test_autotune_unbacked - torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/intel/oneapi/bin/icpx'

There is no such file, indeed. This file is located at:

  • /opt/intel/oneapi/compiler/2024.1/bin/icpx, /opt/intel/oneapi/pytorch-gpu-dev-0.5/bin/icpx in PTDB
  • /opt/intel/oneapi/2025.0/bin/icpx, /opt/intel/oneapi/compiler/2025.0/bin/icpx in DLE

I found the reason - it's because of how pytorch searches for sycl home and it probably relates to pytorch/pytorch@4742080.

Ref to PyTorch: https://github.com/pytorch/pytorch/blame/5872a8c6b00a5c9e45ac4bc99a5c87b93a93aa94/torch/utils/cpp_extension.py#L147

def _find_sycl_home() -> Optional[str]:
    """Find the OneAPI install path."""
    # Guess #1
    sycl_home = os.environ.get('ONEAPI_ROOT')
    if sycl_home is None:
        # Guess #2
        icpx_path = shutil.which('icpx')
        if icpx_path is not None:
            sycl_home = os.path.dirname(os.path.dirname(
                os.path.realpath(icpx_path)))

    if sycl_home and not torch.xpu.is_available():
        print(f"No XPU runtime is found, using ONEAPI_ROOT='{sycl_home}'",
              file=sys.stderr)
    return sycl_home

@anmyachev
Copy link
Contributor Author

@pbchekin any chance that ONEAPI_ROOT for PTDB is /opt/intel/oneapi/2025.0 instead of /opt/intel/oneapi (for DLE)? That would explain the error.

My guess comes from pytorch/pytorch#142242 (comment).

@pbchekin
Copy link
Contributor

pbchekin commented Dec 7, 2024

@pbchekin any chance that ONEAPI_ROOT for PTDB is /opt/intel/oneapi/2025.0 instead of /opt/intel/oneapi (for DLE)? That would explain the error.

My guess comes from pytorch/pytorch#142242 (comment).

Nope:

$ source /opt/intel/oneapi/setvars.sh
...
$ echo $ONEAPI_ROOT
/opt/intel/oneapi

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

Inductor CI with changes from PR 2962: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12221768210

@chengjunlu
Copy link
Contributor

@chengjunlu there are many messages like: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled). in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12196498158/job/34024201713. They really clutter up the logs and don't provide much information. It would be great to remove them somehow. Is this possible?

This warning information is a DPC++ feature. Let me check with torch team how to disable it.

@anmyachev
Copy link
Contributor Author

anmyachev commented Dec 9, 2024

Hi @guangyey!

After this change pytorch/pytorch#135567 our tutorials started to fail with RuntimeError: Overflow when unpacking long exception (ref: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12239696322/job/34140774158?pr=2952).

A quick search through the PyTorch codebase gave me the idea that the problem in infer_scalar_type function https://github.com/pytorch/pytorch/blob/90fc2b42e3e2d51b26a96df0dff4a644e218f8ab/torch/csrc/utils/tensor_new.cpp#L148. Returning type ScalarType:Long instead of type ScalarType::Uint64. Could you take a look?

Example:

# Each pointer is obtained through `tensor.data_ptr()` function.
d_a_ptrs = torch.tensor([18374686479673720832, 18374967954644140032, 18374967954645188608, 18374967954645450752], device=device) # <- failed

Stack trace:

#0  0x00007fffed0824a1 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffec20b246 in THPUtils_unpackLong(_object*) () from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#2  0x00007fffec9b7911 in torch::utils::store_scalar(void*, c10::ScalarType, _object*) ()
   from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#3  0x00007fffec9c1af8 in torch::utils::(anonymous namespace)::recursive_store(char*, c10::ArrayRef<long>, c10::ArrayRef<long>, long, c10::ScalarType, unsigned long, _object*) [clone .isra.0] () from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#4  0x00007fffec9c3460 in torch::utils::(anonymous namespace)::internal_new_from_data(c10::TensorOptions, c10::ScalarType, std::optional<c10::Device>, _object*, bool, bool, bool, bool) () from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#5  0x00007fffec9c8d77 in torch::utils::tensor_ctor(c10::DispatchKey, c10::ScalarType, torch::PythonArgs&) ()
   from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#6  0x00007fffec4ed8f2 in torch::autograd::THPVariable_tensor(_object*, _object*, _object*) ()
   from .../intel-xpu-backend-for-triton/.scripts_cache/pytorch/torch/lib/libtorch_python.so
#7  0x00005555556985a6 in cfunction_call (func=0x7ffff7631800, args=<optimized out>, kwargs=<optimized out>)

Simplified example:

import torch

test = torch.rand((10, 10), device="xpu", dtype=torch.float16)
test_ptr = test.data_ptr()
torch.tensor(test_ptr, device="xpu")  # <- RuntimeError: Overflow when unpacking long

@anmyachev
Copy link
Contributor Author

I decided to try specifying the dtype directly, as suggested in pytorch/pytorch#135628. @guangyey do I understand correctly that it is now recommended to use this approach in the code?

@anmyachev anmyachev marked this pull request as ready for review December 9, 2024 19:47
@guangyey
Copy link

guangyey commented Dec 10, 2024

I decided to try specifying the dtype directly, as suggested in pytorch/pytorch#135628. @guangyey do I understand correctly that it is now recommended to use this approach in the code?

In my understanding, specifying the dtype is the correct approach. This is because we change data_ptr from int64 to uint64 in pytorch/pytorch#135567 and the default int tensor data type is int64. So, users need to be aware to specify dtype to uint64 if they give an overflow value.

>>> b = torch.tensor([1,2])
>>> b.dtype
torch.int64

Copy link

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@anmyachev anmyachev merged commit 3ccab57 into main Dec 10, 2024
5 checks passed
@anmyachev anmyachev deleted the amyachev/issue2945 branch December 10, 2024 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Pytorch pin update] Update PyTorch pin and deprecate elapsed_time patch

5 participants